feat: generalize ChunkManifest to hold inline chunks by maxrjones · Pull Request #938 · zarr-developers/VirtualiZarr

maxrjones · 2026-03-20T02:19:25Z

This implements the design proposed in #851 (comment) for supporting native chunks in ChunkManifest, which allows loading Kerchunk JSONs with in-lined references or eventually Icechunk stores.

I prefer this design over #794, but still don't really like the amount of if native forks needed. Still, this is such a long-overdue feature that it seems worth releasing soon even if it's not the perfect design. We can also improve the internals later on. Let's discuss at the dev meeting tomorrow.

Adds _native: dict[tuple[int, ...], bytes] to ChunkManifest for storing in-memory chunk data
Extends ChunkEntry with optional data: NotRequired[bytes] field
Native chunks propagate through concat/stack/broadcast, pickle, and ManifestStore.get()
Kerchunk parser decodes base64 inlined refs into native chunks
Writers serialize native chunks: base64 for kerchunk, store.set() for icechunk
Docs page + tests including a kerchunk→icechunk→xarray roundtrip

Acceptance criteria:

Closes Generalize ChunkManifest to hold native chunks as well as virtual refs #851
Tests added
Tests passing
No test coverage regression
Full type hint coverage
Changes are documented in docs/releases.md
New functions/methods are listed in an appropriate *.md file under docs/api
New functionality has documentation

codecov · 2026-03-20T05:19:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.91%. Comparing base (e82ac27) to head (de587a9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #938      +/-   ##
==========================================
+ Coverage   89.35%   89.91%   +0.56%     
==========================================
  Files          33       33              
  Lines        2038     2053      +15     
==========================================
+ Hits         1821     1846      +25     
+ Misses        217      207      -10

Files with missing lines	Coverage Δ
virtualizarr/manifests/array_api.py	`98.18% <100.00%> (+0.56%)`	⬆️
virtualizarr/manifests/manifest.py	`92.56% <100.00%> (+6.38%)`	⬆️
virtualizarr/manifests/store.py	`89.72% <100.00%> (+0.44%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TomNicholas

This is awesome, thanks @maxrjones .

My main concern about this PR is that it changes private internals and public behaviour in one go. The more neat way to do it would be to leave any changes to the Kerchunk and Icechunk readers/writers to follow-up PRs.

TomNicholas · 2026-03-20T15:49:14Z

    _paths: np.ndarray[Any, np.dtypes.StringDType]
    _offsets: np.ndarray[Any, np.dtype[np.uint64]]
    _lengths: np.ndarray[Any, np.dtype[np.uint64]]
+    _native: dict[tuple[int, ...], bytes]


How is a scalar native array handled? An empty tuple? Worth having a test for that case.

added a test in [8fc3de3 (this PR)](8fc3de3 (this PR), and yes - https://github.com/maxrjones/VirtualiZarr/blob/8fc3de361e244329b3358abdcc7ba58b3627a5f6/virtualizarr/tests/test_manifests/test_manifest.py#L520-L521

TomNicholas · 2026-03-20T16:14:12Z

I'm also not totally clear on the intended relationship between this feature and the parsers. Do parsers choose if a chunk is native? ¹ IIUC the user has no way to use this feature via loadable_variables? I think that makes sense, but trying to confirm my understanding.

(Also I think we should actually call this "inlined", because it is effectively stored in our ChunkManifest, so that nomenclature would be more consistent with Icechunk and Kerchunk.) ↩

maxrjones · 2026-03-20T19:20:24Z

Thanks for the review @TomNicholas. I addressed all your comments in the last set of commits.

My main concern about this PR is that it changes private internals and public behaviour in one go. The more neat way to do it would be to leave any changes to the Kerchunk and Icechunk readers/writers to follow-up PRs.

Good point, I pulled out the Kerchunk reader portion. I think the writers need to know what to do with inlined data immediately, otherwise you could get buggy serialized data if someone were to write a parser that uses the inlined data feature.

I'm also not totally clear on the intended relationship between this feature and the parsers. Do parsers choose if a chunk is native? 1 IIUC the user has no way to use this feature via loadable_variables? I think that makes sense, but trying to confirm my understanding.

That's right. Parsers chose which chunks are inlined. The user chooses which arrays are loaded via loadable_variables. Parsers could offer configuration options for per-chunk inlining, but probably won't.

TomNicholas · 2026-03-20T21:03:50Z

Good point, I pulled out the Kerchunk reader portion. I think the writers need to know what to do with inlined data immediately, otherwise you could get buggy serialized data if someone were to write a parser that uses the inlined data feature.

I mean you could just update the writers to raise NotImplementedError if there are any native chunks, but the splitting you've already done is good, thanks you.

Parsers chose which chunks are inlined.

Cool. So it is really an implementation detail. It's only relevant for Parser authors. So doesn't that mean the docs for it should only be under "Writing a custom parser"?

TomNicholas · 2026-03-24T20:47:34Z

@maxrjones the changes don't look right any more - where is the new attribute on chunk manifest for holding the buffer?

Also please:

remove any changes to the Icechunk writer
move the docs into the custom parsers section

maxrjones · 2026-03-24T20:52:16Z

@maxrjones the changes don't look right any more - where is the new attribute on chunk manifest for holding the buffer?

https://github.com/maxrjones/VirtualiZarr/blob/922335028e624c0fe4d9a49aa241d45cffb6195b/virtualizarr/manifests/manifest.py#L214

Also please:

remove any changes to the Icechunk writer

move the docs into the custom parsers section

will do

TomNicholas · 2026-03-24T21:11:58Z

Ah sorry. We also need to test other basic operations on manifests in this PR, such as concatenation, broadcasting, and any slicing we support. Might be easier to change fixtures of existing tests than to add new ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broadcast should replicate inlined chunk bytes to every position along an expanded axis, matching the behaviour already observed for virtual chunks. Three of the four new tests fail under the current implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously _broadcast_manifest only prepended singleton dimensions to inlined chunk keys, leaving a single dict entry even when np.broadcast_to expanded an axis. Reads at the replicated positions would find the INLINED_CHUNK_PATH sentinel in the paths array but miss the _inlined dict, producing broken behaviour in ManifestStore.get. Now we replicate each inlined entry to every target position along any axis that was size 1 in the source, mirroring how the paths/offsets/lengths arrays are broadcast. The bytes themselves are shared by reference, not copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Locks in the existing behaviour of _concat_manifests and _stack_manifests for manifests containing inlined chunks: keys are shifted along the concat axis or gain the stack-axis index, and bytes are shared by reference rather than copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Confirms replicated entries share the same bytes object rather than allocating copies at each expanded position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When two ManifestArrays share paths/offsets/lengths but have different inlined chunk data, ManifestArray.__eq__ falls through to its 'over-cautious' fallback via ChunkManifest.elementwise_eq, which does not currently compare inlined bytes. That triggers RuntimeWarning('Should not be possible to get here'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously elementwise_eq only looked at paths/offsets/lengths, which all agree for inlined chunks even when their bytes differ. That let two ChunkManifests disagree per __eq__ but look identical per elementwise_eq, tripping the 'Should not be possible to get here' branch in ManifestArray.__eq__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers the inlined-chunk branch in ManifestStore.get including byte-range variants (RangeByteRequest, OffsetByteRequest, SuffixByteRequest), a mixed manifest where inlined and virtual chunks are served from the same array, and list_dir enumeration of inlined chunk keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

TomNicholas · 2026-04-23T19:33:14Z

@maxrjones I hope you don't mind but I took over this PR.

I responded to my own review by making these changes:

Reverted all Icechunk writer changes — kept strictly to the ChunkManifest internals
Docs moved out of a standalone page into data_structures.md under ## Chunk Manifests
Tests added for concat / stack / broadcast of manifests with inlined chunks, plus ManifestStore reads (byte-range variants, mixed inlined/virtual, list_dir)
Fixed a broadcast bug: _broadcast_manifest wasn't replicating inlined entries along expanded axes — a (1,2) → (3,2) broadcast would leave the replicated positions missing from _inlined even though _paths claimed __inlined__ there. Now replicates by reference, no byte copies.
Fixed an equality bug: elementwise_eq didn't compare inlined bytes, which broke an invariant in ManifestArray.__eq__ and tripped RuntimeWarning("Should not be possible to get here") for two manifests with identical paths/offsets/lengths but different inlined bytes.

maxrjones · 2026-04-23T21:09:02Z

@maxrjones I hope you don't mind but I took over this PR.

I responded to my own review by making these changes:

Reverted all Icechunk writer changes — kept strictly to the ChunkManifest internals

Docs moved out of a standalone page into data_structures.md under ## Chunk Manifests

Tests added for concat / stack / broadcast of manifests with inlined chunks, plus ManifestStore reads (byte-range variants, mixed inlined/virtual, list_dir)

Fixed a broadcast bug: _broadcast_manifest wasn't replicating inlined entries along expanded axes — a (1,2) → (3,2) broadcast would leave the replicated positions missing from _inlined even though _paths claimed __inlined__ there. Now replicates by reference, no byte copies.

Fixed an equality bug: elementwise_eq didn't compare inlined bytes, which broke an invariant in ManifestArray.__eq__ and tripped RuntimeWarning("Should not be possible to get here") for two manifests with identical paths/offsets/lengths but different inlined bytes.

this is great, thank you very much for taking this on! Those changes are all very helpful. I haven't done a line-by-line review, but would be happy with you merging whenever you feel it's ready.

Validation previously used a subset check, which silently accepted entries with unknown keys alongside the required path/offset/length. Now the entry key set must match exactly one of the two valid shapes: virtual ({path, offset, length}) or inlined ({path, offset, length, data}). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Calls out the path-value convention used by ChunkManifest entries so parser authors have a single, discoverable reference for distinguishing virtual, missing, and inlined chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broadcasting inlined chunks not only prepends singleton dims to their keys, but also replicates the bytes (by reference) across every position of an expanded axis, per the fix in zarr-developers#938. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#981) * Add failing test for writing inlined chunks to icechunk The icechunk writer currently sends the INLINED_CHUNK_PATH sentinel ('__inlined__') straight into icechunk's set_virtual_refs_arr, which rejects it as a malformed virtual URL. The new test writes a manifest containing one inlined chunk plus one virtual chunk, commits, then re-opens via xarray and asserts the values match end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Support writing inlined ChunkManifest entries to icechunk Icechunk's set_virtual_refs_arr rejects the INLINED_CHUNK_PATH sentinel ('__inlined__') as a malformed URL. write_manifest_to_icechunk now writes inlined chunks first as native chunks via store.set, then rewrites those positions to empty strings in the paths array before calling set_virtual_refs_arr with the cleaned view. A cheap numpy-level check skips the virtual-refs call entirely for all-inlined or all-missing manifests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update broadcast description to reflect bytes replication Broadcasting inlined chunks not only prepends singleton dims to their keys, but also replicates the bytes (by reference) across every position of an expanded axis, per the fix in #938. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: generalize ChunkManifest to hold native chunks

54e9b48

maxrjones temporarily deployed to test-release March 20, 2026 02:19 — with GitHub Actions Inactive

maxrjones mentioned this pull request Mar 20, 2026

refactor: add iteration helpers to ChunkManifest #939

Merged

8 tasks

TomNicholas reviewed Mar 20, 2026

View reviewed changes

Comment thread docs/inlined_references.md Outdated

TomNicholas requested changes Mar 20, 2026

View reviewed changes

TomNicholas added the internals label Mar 20, 2026

maxrjones added 7 commits March 20, 2026 13:55

Rename native to inlined

1f1ead8

Move docs to explanation

13adb46

Rename data to inlined_data

8516604

Better sentinel values

04a420f

Improve required entry validation

e4ebc28

Add scalar test

8fc3de3

Revert changes that should be a separate PR

9223350

maxrjones temporarily deployed to test-release March 20, 2026 18:30 — with GitHub Actions Inactive

maxrjones self-assigned this Apr 3, 2026

maxrjones mentioned this pull request Apr 20, 2026

ODD PI 26.2 Objective 5: 🤪 Expand virtualization support for quirky datasets NASA-IMPACT/veda-odd#308

Open

5 tasks

Merge branch 'main' into store-native-chunks

6e37005

maxrjones temporarily deployed to test-release April 20, 2026 19:40 — with GitHub Actions Inactive

Fix mypy: avoid narrowing StringDType on np.where reassignment

c97cf39

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TomNicholas temporarily deployed to test-release April 22, 2026 22:45 — with GitHub Actions Inactive

Revert icechunk writer changes; handle inlined chunks in a follow-up PR

d7b0abd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TomNicholas temporarily deployed to test-release April 23, 2026 17:33 — with GitHub Actions Inactive

Move inlined chunks docs into data_structures.md

e75c7f7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TomNicholas temporarily deployed to test-release April 23, 2026 17:46 — with GitHub Actions Inactive

TomNicholas and others added 9 commits April 23, 2026 14:15

Add bytes-identity test for broadcasting inlined chunks

90aeeee

Confirms replicated entries share the same bytes object rather than allocating copies at each expanded position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Smoke test that to_virtual_variable preserves inlined chunks

06345da

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5122636

for more information, see https://pre-commit.ci

pre-commit-ci Bot temporarily deployed to test-release April 23, 2026 19:27 Inactive

Merge branch 'main' into store-native-chunks

f328b11

maxrjones temporarily deployed to test-release April 23, 2026 21:25 — with GitHub Actions Inactive

TomNicholas changed the title ~~feat: generalize ChunkManifest to hold native chunks~~ feat: generalize ChunkManifest to hold inline chunks Apr 23, 2026

Merge branch 'main' into store-native-chunks

a51602c

TomNicholas temporarily deployed to test-release April 24, 2026 14:23 — with GitHub Actions Inactive

TomNicholas temporarily deployed to test-release April 24, 2026 15:08 — with GitHub Actions Inactive

TomNicholas and others added 2 commits April 24, 2026 11:11

Add release note for inlined chunks support

de587a9

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TomNicholas temporarily deployed to test-release April 24, 2026 15:19 — with GitHub Actions Inactive

TomNicholas merged commit 3194b09 into zarr-developers:main Apr 24, 2026
17 checks passed

This was referenced Apr 24, 2026

Support inlined Kerchunk data using obstore MemoryStore? #636

Closed

feat: parse kerchunk inline refs into inlined ChunkManifest entries #979

Merged

Conversation

maxrjones commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

TomNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomNicholas Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

maxrjones Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TomNicholas commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

maxrjones commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Mar 20, 2026

Uh oh!

TomNicholas commented Mar 24, 2026

Uh oh!

maxrjones commented Mar 24, 2026

Uh oh!

TomNicholas commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Apr 23, 2026

Uh oh!

maxrjones commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxrjones commented Mar 20, 2026 •

edited

Loading

codecov Bot commented Mar 20, 2026 •

edited

Loading

TomNicholas commented Mar 20, 2026 •

edited

Loading

maxrjones commented Mar 20, 2026 •

edited

Loading

TomNicholas commented Mar 24, 2026 •

edited

Loading